Skip to content

MDEV-21423 - lock-free trx_sys get performance regression cause by lf_find and ut_delay#5043

Merged
svoj merged 1 commit into
MariaDB:11.8from
svoj:pr-main-MDEV-21423
May 27, 2026
Merged

MDEV-21423 - lock-free trx_sys get performance regression cause by lf_find and ut_delay#5043
svoj merged 1 commit into
MariaDB:11.8from
svoj:pr-main-MDEV-21423

Conversation

@svoj

@svoj svoj commented May 5, 2026

Copy link
Copy Markdown
Contributor

Under high concurrency, MVCC snapshot creation may spend a significant amount of time in lf_hash_iterate()/lfind() while collecting active read-write transaction identifiers. This overhead is particularly visible in sysbench oltp_read_write with transaction-isolation=READ-COMMITTED.

Iteration cost becomes high due to significant TLB thrashing and poor memory locality in this hot code path because snapshot creation touches many rw_trx_hash nodes distributed across memory, including dummy nodes that are irrelevant for snapshot construction. In addition, traversing LF_HASH requires issuing heavyweight memory barriers.

This is a performance regression after svoj@53cc9aa, which changed MVCC snapshot creation to scan LF_HASH instead of maintaining a global sorted vector protected by the global mutex.

Add trx_sys.rw_trx_ids, a compact traversal-friendly vector of active read-write transaction identifiers and serialization numbers optimized for MVCC snapshot creation, while rw_trx_hash remains responsible for transaction lookup.

The vector may contain empty slots corresponding to idle or read-only transactions that currently do not own a read-write transaction identifier. Such slots are skipped by snapshot creation.

This reduces traversal overhead during MVCC snapshot creation by improving memory locality, reducing TLB pressure, and avoiding repeated memory barriers required for rw_trx_hash traversal.

@svoj svoj requested a review from dr-m May 5, 2026 22:10

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new rw_trx_ids_t class to manage read-write transaction IDs and serialization numbers, moving them from the hash elements into a centralized vector within the transaction system. This involves significant updates to transaction registration, deregistration, and snapshotting logic across InnoDB. Feedback highlights a critical thread-safety issue where the ids vector is accessed without a read lock during potential reallocations, which could lead to memory corruption. Additionally, the use of memset to initialize a synchronization primitive was identified as unsafe and should be removed in favor of the standard initialization call.

Comment thread storage/innobase/include/trx0sys.h Outdated
Comment thread storage/innobase/include/trx0sys.h Outdated
Comment thread storage/innobase/include/trx0sys.h Outdated
Comment thread storage/innobase/include/trx0sys.h Outdated
Comment thread storage/innobase/include/trx0sys.h Outdated
Comment thread storage/innobase/log/log0log.cc Outdated
@svoj svoj force-pushed the pr-main-MDEV-21423 branch from 3b998f0 to 6d2f280 Compare May 6, 2026 07:11
Comment thread storage/innobase/include/trx0sys.h Outdated
Comment thread storage/innobase/include/trx0sys.h
Comment thread storage/innobase/include/trx0sys.h
Comment thread storage/innobase/include/trx0sys.h
Comment thread storage/innobase/include/trx0sys.h Outdated
Comment thread storage/innobase/include/trx0sys.h Outdated
Comment thread storage/innobase/include/trx0sys.h
Comment thread storage/innobase/include/trx0trx.h Outdated
@svoj svoj force-pushed the pr-main-MDEV-21423 branch 2 times, most recently from 0de23ce to ac20be7 Compare May 6, 2026 14:21

@dr-m dr-m left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very promising. The reason why I won’t give an approval yet is that I didn’t review all details of this thoroughly, especially around startup and shutdown.

This needs to be tested, both for performance and stability. Please coordinate with the testers on this.

Comment thread storage/innobase/include/trx0trx.h Outdated
@mariadb-SaahilAlam

Copy link
Copy Markdown

Test run completed on  ac20be7
No issues were found during testing

@svoj svoj force-pushed the pr-main-MDEV-21423 branch 3 times, most recently from aaeeae2 to 6cf1240 Compare May 16, 2026 09:19

@dr-m dr-m left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since my previous review, the single std::vector has been split into two: rw_trx_vector::ids comprising the transaction start and end identifiers, and rw_trx_vector::trxs comprising the pointers to the transaction objects. I understood that this yielded the best performance, presumably because when iterating the identifiers, we would wasting cache line capacity on trx_t* and possible padding.

I only have a minor comment about documenting and enforcing the lifetime of trx_t::rw_trx_ids_slot.

I would request to rebase this to the 11.8 branch, because that is the branch where all InnoDB performance improvements done so far have landed. Also, the target customer for this improvement is moving to the MariaDB Enterprise Server 11.8 release.

Comment on lines +632 to +633
/** trx_sys.rw_trx_ids index, protected by trx_sys.rw_trx_ids.latch */
uint32_t rw_trx_ids_slot;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this value change, and what is its lifetime? I see that we are not initializing this in trx_init() but directly in TrxFactory::init(). There’s no revision to assert_freed() or trx_t::free() either. Normally, we assert that all fields have been cleared, and we mark the memory inaccessible for AddressSanitizer.

I initially suspected that similar to max_inactive_id, this could be a "cache" that can remain valid across transaction restarts that happen to use the same memory buffer. But, it turns out that rw_trx_vector::register_trx() and rw_trx_vector::deregister_trx() guarantee that this field will be std::numeric_limits<uint32_t>::max() whenever a transaction is being allocated or freed.

@svoj svoj May 27, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets valid value at trx_create() / trx_sys_t::register_trx() and the value remains until trx_t::free() / trx_sys_t::deregister_trx().

trx_t::assert_freed() is called from trx_t::commit() / trx_t::commit_cleanup(), which is transaction end. But rw_trx_ids_slot has to live until session end.

trx_init() is not suitable either, trx_create() is the beginning of the scope.

The sole purpose of std::numeric_limits<uint32_t>::max() was to assert that rw_trx_vector is not being accessed with uninitialized rw_trx_ids_slot. But it looks somewhat excessive and can be removed.

Comment on lines 1115 to 1116
ut_ad(state == TRX_STATE_NOT_STARTED);
ut_ad(!id);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between these lines we are missing the following:

    ut_ad(rw_trx_ids_slot == std::numeric_limits<uint32_t>::max());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trx_t::assert_freed() is called from trx_t::commit() / trx_t::commit_cleanup(), which is transaction end. But rw_trx_ids_slot has to live until session end.

Comment thread storage/innobase/trx/trx0trx.cc
…_find and ut_delay

Under high concurrency, MVCC snapshot creation may spend a significant
amount of time in lf_hash_iterate()/l_find() while collecting active
read-write transaction identifiers. This overhead is particularly
visible in sysbench oltp_read_write with
transaction-isolation=READ-COMMITTED.

Iteration cost becomes high due to significant TLB thrashing and poor
memory locality in this hot code path because snapshot creation touches
many rw_trx_hash nodes distributed across memory, including dummy
nodes that are irrelevant for snapshot construction. In addition,
traversing LF_HASH requires issuing heavyweight memory barriers.

This is a performance regression after 53cc9aa, which changed
MVCC snapshot creation to scan LF_HASH instead of maintaining a global
sorted vector protected by the global mutex.

Add trx_sys.rw_trx_ids, a compact traversal-friendly vector of active
read-write transaction identifiers and serialization numbers optimized
for MVCC snapshot creation, while rw_trx_hash remains responsible for
transaction lookup.

The vector may contain empty slots corresponding to idle or read-only
transactions that currently do not own a read-write transaction
identifier. Such slots are skipped by snapshot creation.

This reduces traversal overhead during MVCC snapshot creation by
improving memory locality, reducing TLB pressure, and avoiding repeated
memory barriers required for rw_trx_hash traversal.
@svoj svoj force-pushed the pr-main-MDEV-21423 branch from 6cf1240 to 429f108 Compare May 27, 2026 17:44
@CLAassistant

CLAassistant commented May 27, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@svoj svoj changed the base branch from main to 11.8 May 27, 2026 17:45
@svoj svoj merged commit 241050c into MariaDB:11.8 May 27, 2026
16 of 18 checks passed
@dr-m dr-m mentioned this pull request May 28, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants